This lesson covers packages primarily by Hadley Wickham for tidying data and then working with it in tidy form.

The packages we are using in this lesson are all from CRAN, so we can install them with install.packages.

install.packages(c(
    # Hadley Wickham packages
    "readr",    # read tabular data
    "tidyr",    # data frame tidying functions
    "dplyr",    # general data frame manipulation
    "ggplot2",  # flexible plotting
    
    # Viridis color scale 
    "viridis",

    # (Advanced!) Tidy linear modelling    
    "broom"
))
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(viridis)
library(broom)

These packages usually have useful documentation in the form of “vignettes”. These are readable on the CRAN website, or within R:

vignette()
vignette(package="dplyr")
vignette("introduction", package="dplyr")

dplyr

Let’s continue our examination of the FastQC output. If you’re starting fresh for this lesson, you can load the necessary data frame with:

bigtab <- read_csv("r-more-files/fastqc.csv")
bigtab$grade <- factor(bigtab$grade, c("FAIL","WARN","PASS"))

bigtab
## # A tibble: 72 x 3
##     grade                         test       file
##    <fctr>                        <chr>      <chr>
## 1    PASS             Basic Statistics Day0.fastq
## 2    PASS    Per base sequence quality Day0.fastq
## 3    PASS    Per tile sequence quality Day0.fastq
## 4    PASS  Per sequence quality scores Day0.fastq
## 5    FAIL    Per base sequence content Day0.fastq
## 6    WARN      Per sequence GC content Day0.fastq
## 7    PASS           Per base N content Day0.fastq
## 8    PASS Sequence Length Distribution Day0.fastq
## 9    PASS  Sequence Duplication Levels Day0.fastq
## 10   WARN    Overrepresented sequences Day0.fastq
## # ... with 62 more rows

bigtab is a “tibble”, which is Hadley Wickham’s slightly improved data frame. One convenient feature is that it only shows a few rows of the data frame when you print it. If you do want to see the whole table, you can use as.data.frame, or use the View function to view it in a tab.

as.data.frame(bigtab)
View(bigtab)

dplyr gives us a collection of convenient “verbs” for manipulating data frames.

filter

Say we want to know all the tests that failed. filter provides a convenient way to get these.

filter(bigtab, grade == "FAIL")
## # A tibble: 8 x 3
##    grade                      test        file
##   <fctr>                     <chr>       <chr>
## 1   FAIL Per base sequence content  Day0.fastq
## 2   FAIL Per base sequence content  Day4.fastq
## 3   FAIL Per base sequence content  Day7.fastq
## 4   FAIL   Per sequence GC content  Day7.fastq
## 5   FAIL Per base sequence content Day10.fastq
## 6   FAIL Per base sequence content Day15.fastq
## 7   FAIL Per base sequence content Day20.fastq
## 8   FAIL   Per sequence GC content Day20.fastq

Something is magic here: we do not have grade in our environment, only within the data frame.

arrange

Rather than filtering, we might instead want to sort the data so the most important rows are at the top. arrange sorts a data frame by one or more columns.

arrange(bigtab, grade)
## # A tibble: 72 x 3
##     grade                      test        file
##    <fctr>                     <chr>       <chr>
## 1    FAIL Per base sequence content  Day0.fastq
## 2    FAIL Per base sequence content  Day4.fastq
## 3    FAIL Per base sequence content  Day7.fastq
## 4    FAIL   Per sequence GC content  Day7.fastq
## 5    FAIL Per base sequence content Day10.fastq
## 6    FAIL Per base sequence content Day15.fastq
## 7    FAIL Per base sequence content Day20.fastq
## 8    FAIL   Per sequence GC content Day20.fastq
## 9    WARN   Per sequence GC content  Day0.fastq
## 10   WARN Overrepresented sequences  Day0.fastq
## # ... with 62 more rows
# desc( ) can be used to reverse the sort order
arrange(bigtab, desc(grade))
## # A tibble: 72 x 3
##     grade                         test       file
##    <fctr>                        <chr>      <chr>
## 1    PASS             Basic Statistics Day0.fastq
## 2    PASS    Per base sequence quality Day0.fastq
## 3    PASS    Per tile sequence quality Day0.fastq
## 4    PASS  Per sequence quality scores Day0.fastq
## 5    PASS           Per base N content Day0.fastq
## 6    PASS Sequence Length Distribution Day0.fastq
## 7    PASS  Sequence Duplication Levels Day0.fastq
## 8    PASS              Adapter Content Day0.fastq
## 9    PASS                 Kmer Content Day0.fastq
## 10   PASS             Basic Statistics Day4.fastq
## # ... with 62 more rows

select

dplyr’s select function is for subsetting, renaming, and reordering columns. Old columns can be referred to by name or by number.

select(bigtab, test,grade) 
## # A tibble: 72 x 2
##                            test  grade
##                           <chr> <fctr>
## 1              Basic Statistics   PASS
## 2     Per base sequence quality   PASS
## 3     Per tile sequence quality   PASS
## 4   Per sequence quality scores   PASS
## 5     Per base sequence content   FAIL
## 6       Per sequence GC content   WARN
## 7            Per base N content   PASS
## 8  Sequence Length Distribution   PASS
## 9   Sequence Duplication Levels   PASS
## 10    Overrepresented sequences   WARN
## # ... with 62 more rows
select(bigtab, 2,1)
## # A tibble: 72 x 2
##                            test  grade
##                           <chr> <fctr>
## 1              Basic Statistics   PASS
## 2     Per base sequence quality   PASS
## 3     Per tile sequence quality   PASS
## 4   Per sequence quality scores   PASS
## 5     Per base sequence content   FAIL
## 6       Per sequence GC content   WARN
## 7            Per base N content   PASS
## 8  Sequence Length Distribution   PASS
## 9   Sequence Duplication Levels   PASS
## 10    Overrepresented sequences   WARN
## # ... with 62 more rows
select(bigtab, foo=file, bar=test, baz=grade)
## # A tibble: 72 x 3
##           foo                          bar    baz
##         <chr>                        <chr> <fctr>
## 1  Day0.fastq             Basic Statistics   PASS
## 2  Day0.fastq    Per base sequence quality   PASS
## 3  Day0.fastq    Per tile sequence quality   PASS
## 4  Day0.fastq  Per sequence quality scores   PASS
## 5  Day0.fastq    Per base sequence content   FAIL
## 6  Day0.fastq      Per sequence GC content   WARN
## 7  Day0.fastq           Per base N content   PASS
## 8  Day0.fastq Sequence Length Distribution   PASS
## 9  Day0.fastq  Sequence Duplication Levels   PASS
## 10 Day0.fastq    Overrepresented sequences   WARN
## # ... with 62 more rows

You can also remove specific columns like this:

select(bigtab, -file)
## # A tibble: 72 x 2
##     grade                         test
##    <fctr>                        <chr>
## 1    PASS             Basic Statistics
## 2    PASS    Per base sequence quality
## 3    PASS    Per tile sequence quality
## 4    PASS  Per sequence quality scores
## 5    FAIL    Per base sequence content
## 6    WARN      Per sequence GC content
## 7    PASS           Per base N content
## 8    PASS Sequence Length Distribution
## 9    PASS  Sequence Duplication Levels
## 10   WARN    Overrepresented sequences
## # ... with 62 more rows

Joins

Say we want to convert PASS/WARN/FAIL into a numeric score, so we can produce some numerical summaries. The scoring scheme will be:

fwp <- c("FAIL","WARN","PASS")
scoring <- tibble(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))

# Or: 
# scoring <- data.frame(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))

scoring
## # A tibble: 3 x 2
##    grade score
##   <fctr> <dbl>
## 1   FAIL   0.0
## 2   WARN   0.5
## 3   PASS   1.0

In the introduction to R day, we saw the merge function. dplyr has its own version of this in the form of several functions: left_join, right_join, inner_join, full_join.

The difference between these functions is what happens when there is a row in one data frame without a corresponding row in the other data frame. inner_join discards such rows. full_join always keeps them, filling in missing data with NA. left_join always keeps rows from the first data frame. right_join always keeps rows from the second data frame.

Often we use a join to augment a data frame with some extra information. left_join is a good default choice for this as it will never delete rows from the data frame that is being augmented.

scoretab <- left_join(bigtab, scoring, by="grade")
scoretab
## # A tibble: 72 x 4
##     grade                         test       file score
##    <fctr>                        <chr>      <chr> <dbl>
## 1    PASS             Basic Statistics Day0.fastq   1.0
## 2    PASS    Per base sequence quality Day0.fastq   1.0
## 3    PASS    Per tile sequence quality Day0.fastq   1.0
## 4    PASS  Per sequence quality scores Day0.fastq   1.0
## 5    FAIL    Per base sequence content Day0.fastq   0.0
## 6    WARN      Per sequence GC content Day0.fastq   0.5
## 7    PASS           Per base N content Day0.fastq   1.0
## 8    PASS Sequence Length Distribution Day0.fastq   1.0
## 9    PASS  Sequence Duplication Levels Day0.fastq   1.0
## 10   WARN    Overrepresented sequences Day0.fastq   0.5
## # ... with 62 more rows

“grade” is here acting as the key, allowing corresponding rows in the data frames to be joined together.

One important thing that all the join functions do: if multiple rows have the same key in either data frame, all ways of combining the two sets of rows will be included in the result. So, here, rows from the scoring data frame have been copied many times in the output.

Joins are the key tool for integrating different types of data, based on some common key such as gene id.

mutate

mutate lets us add or overwrite columns by computing a new value for them.

mutate(scoretab, doublescore = score*2)
## # A tibble: 72 x 5
##     grade                         test       file score doublescore
##    <fctr>                        <chr>      <chr> <dbl>       <dbl>
## 1    PASS             Basic Statistics Day0.fastq   1.0           2
## 2    PASS    Per base sequence quality Day0.fastq   1.0           2
## 3    PASS    Per tile sequence quality Day0.fastq   1.0           2
## 4    PASS  Per sequence quality scores Day0.fastq   1.0           2
## 5    FAIL    Per base sequence content Day0.fastq   0.0           0
## 6    WARN      Per sequence GC content Day0.fastq   0.5           1
## 7    PASS           Per base N content Day0.fastq   1.0           2
## 8    PASS Sequence Length Distribution Day0.fastq   1.0           2
## 9    PASS  Sequence Duplication Levels Day0.fastq   1.0           2
## 10   WARN    Overrepresented sequences Day0.fastq   0.5           1
## # ... with 62 more rows

This is equivalent to

scoretab2 <- scoretab
scoretab2$doublescore <- scoretab2$score * 2

summarize

summarize lets us compute summaries of data.

summarize(scoretab, total=sum(score))
## # A tibble: 1 x 1
##   total
##   <dbl>
## 1    60

We really want to summarize the data grouped by file, so we can see if there are any particularly bad files. This is achieved using the group_by “adjective”. group_by gives the data frame a grouping using one or more columns, which modifies the subsequent call to summarize.

group_by(scoretab, file)
## Source: local data frame [72 x 4]
## Groups: file [6]
## 
##     grade                         test       file score
##    <fctr>                        <chr>      <chr> <dbl>
## 1    PASS             Basic Statistics Day0.fastq   1.0
## 2    PASS    Per base sequence quality Day0.fastq   1.0
## 3    PASS    Per tile sequence quality Day0.fastq   1.0
## 4    PASS  Per sequence quality scores Day0.fastq   1.0
## 5    FAIL    Per base sequence content Day0.fastq   0.0
## 6    WARN      Per sequence GC content Day0.fastq   0.5
## 7    PASS           Per base N content Day0.fastq   1.0
## 8    PASS Sequence Length Distribution Day0.fastq   1.0
## 9    PASS  Sequence Duplication Levels Day0.fastq   1.0
## 10   WARN    Overrepresented sequences Day0.fastq   0.5
## # ... with 62 more rows
summarize(group_by(scoretab, file), total=sum(score))
## # A tibble: 6 x 2
##          file total
##         <chr> <dbl>
## 1  Day0.fastq  10.0
## 2 Day10.fastq  10.0
## 3 Day15.fastq  10.0
## 4 Day20.fastq   9.5
## 5  Day4.fastq  10.5
## 6  Day7.fastq  10.0

The special function n() can be used within summarize to get the number of rows. (This also works in mutate, but is most useful in summarize.)

summarize(group_by(bigtab, grade), count=n())
## # A tibble: 3 x 2
##    grade count
##   <fctr> <int>
## 1   FAIL     8
## 2   WARN     8
## 3   PASS    56

So summarize lets us do the things we did with table and tapply, but staying tidy by using data frames.

Tip: group_by also affects other verbs, such as mutate. This is more advanced than we will be going in to today. Grouping can be removed with ungroup.

See also functions for common uses of summarize: count, distinct.

The pipe %>%

We often want to string together a series of dplyr functions. This is achieved using dplyr’s new pipe operator, %>%. This takes the value on the left, and passes it as the first argument to the function call on the right. So the previous summarization could be written:

bigtab %>% group_by(grade) %>% summarize(count=n())

%>% isn’t limited to dplyr functions. It’s an alternative way of writing any R code.

rep("hello", 5)
## [1] "hello" "hello" "hello" "hello" "hello"
"hello" %>% rep(5)
## [1] "hello" "hello" "hello" "hello" "hello"

ggplot2 revisited

This section builds on what we saw of ggplot2 in the introductory R day. Recall that we could assign columns of a data from to aesthetics–x and y position, color, etc–and then add “geom”s to draw the data.

The idea of adding geoms is rather like the dplyr pipe. ggplot2 predates dplyr, and Hadley Wickham has had a progression of ideas. It will probably be possible to use %>% instead of + in the not too distant future.

With ggplot2 we don’t need to summarize bigtab, we can just view the whole data set.

ggplot(bigtab,aes(x=file,y=test,color=grade)) + geom_point()

With categorical data on the x and y axes, an even better geom to use is geom_tile.

ggplot(bigtab,aes(x=file,y=test,fill=grade)) + geom_tile()

Publication quality images

ggplot2 offers a very wide array of ways to adjust a plot.

plot <- ggplot(bigtab,aes(x=file,y=test,fill=grade)) + 
    geom_tile(color="black",size=0.5) +
    labs(x="",y="",fill="") +                 # Remove axis labels
    coord_fixed() +                           # Square tiles
    theme_minimal() +                         # Minimal theme, no grey background
    theme(panel.grid=element_blank(),         # No underlying grid lines
          axis.text.x=element_text(           # Vertical text on x axis
              angle=90,vjust=0.5,hjust=1))              
plot

We would ideally also reorder the x and y axes meaningfully, and refine the x axis labels.

Plots can be saved with ggsave. Width and height are given in inches, and an image resolution in Dots Per Inch should also be given. The width and height will determine the relative size of text and graphics, so increasing the resolution is best done by adjusting the DPI. Compare the files produced by:

ggsave("plot1.png", plot, width=5,  height=5,  dpi=600)
ggsave("plot2.png", plot, width=10, height=10, dpi=300)

It may be necessary to edit a plot further in a program such as Inkscape or Adobe Illustrator. To allow this, the image needs to be saved in a “vector” format such as SVG, EPS, or PDF. In this case, the DPI argument isn’t needed.

A simple RNA-Seq example

We will now look at a small gene expression data set. To simplify somewhat: The data was generated from a multiplex PCR targetting 44 genes. The PCR product is quantified by counting reads from high throughput sequencing. The experiment is of yeast released synchronously into cell cycling, so that we can see which genes are active at different points in the cycle. Several strains of yeast have been used, a wildtype and several gene knock-downs. Each has nine time points, covering two cell cycles (two hours).

This is much smaller than a typical RNA-Seq dataset. We are only looking at 44 genes, rather than all of the thousands of genes in the yeast genome.

Recall the three activities involved in data analysis:

This data set requires all of them.

Tidying

Hint: You can select from the top of a pipeline to partway through and press Ctrl-Enter or Command-Enter to see the value part-way through the pipeline.

# Use readr's read_csv function to load the file
counts_untidy <- read_csv("r-more-files/read-counts.csv")

counts <- counts_untidy %>%
    gather(sample, count, -Feature, factor_key=TRUE) %>%
    separate(sample, sep=":", into=c("strain","time"), convert=TRUE, remove=FALSE) %>%
    mutate(
        strain = factor(strain, levels=c("WT","SET1","RRP6","SET1-RRP6")),
        set1 = (strain == "SET1" | strain == "SET1-RRP6"),
        rrp6 = (strain == "RRP6" | strain == "SET1-RRP6")
    ) %>%
    filter(time >= 2) %>%
    select(sample, strain, set1, rrp6, time, gene=Feature, count)

summary(counts)
##      sample           strain       set1            rrp6        
##  WT:2   :  45   WT       :405   Mode :logical   Mode :logical  
##  WT:3   :  45   SET1     :405   FALSE:810       FALSE:810      
##  WT:4   :  45   RRP6     :405   TRUE :810       TRUE :810      
##  WT:5   :  45   SET1-RRP6:405   NA's :0         NA's :0        
##  WT:6   :  45                                                  
##  WT:7   :  45                                                  
##  (Other):1350                                                  
##       time        gene               count       
##  Min.   : 2   Length:1620        Min.   :     0  
##  1st Qu.: 4   Class :character   1st Qu.:   172  
##  Median : 6   Mode  :character   Median :   666  
##  Mean   : 6                      Mean   :  3612  
##  3rd Qu.: 8                      3rd Qu.:  2548  
##  Max.   :10                      Max.   :143592  
## 

Here we’ve used two tidying functions from tidyr:

  • gather gathers columns into a key column and a value column. It’s similar to reshape2’s melt function.

  • separate splits character columns by a separator string (actually, a regular expression), creating new columns.

“Time point 1” was omitted because it isn’t actually part of the time series, it’s from cells in an unsynchronized state. If we wanted to be even tidier, the remaining time points could be recoded with the actual time within the experiment – they’re at 15 minute intervals.

Transformation and normalization

ggplot(counts, aes(x=sample, y=count)) + geom_boxplot() + coord_flip()

We immediately see from a Tukey box plot that the data is far from normal – there are outliers many box-widths to the right of the box. Perhaps a log transformation is in order.

ggplot(counts, aes(x=sample, y=log2(count))) + geom_boxplot() + coord_flip()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

ggplot(counts, aes(x=log2(count), group=sample)) + geom_line(stat="density")
## Warning: Removed 14 rows containing non-finite values (stat_density).

Log transformation has greatly improved things, although now there are more outliers left of the whiskers, and some zeros we will need to be careful of. We also see one of the samples has less reads than the others.

The gene SRP68 was included as a control housekeeping gene. Its expression level should be the same in all samples. We will divide counts in each sample by the count for SRP68 to correct for library size, then log transform the data. We add a small constant when log transforming so that zeros do not become negative infinity.

normalizer <- counts %>%
    filter(gene == "SRP68") %>%
    select(sample, norm=count)

moderation <- 0.5/mean(normalizer$norm)

counts_norm <- counts %>%
    left_join(normalizer, by="sample") %>%
    mutate(
        norm_count = count/norm, 
        log_norm_count = log2(norm_count+moderation)
    )

ggplot(counts_norm, aes(x=sample, y=log_norm_count)) + geom_boxplot() + coord_flip()

Visualization

ggplot(counts_norm, aes(x=time, y=gene, fill=log_norm_count)) + 
    geom_tile() + facet_grid(~ strain) + 
    scale_fill_viridis() + theme_minimal()

ggplot(counts_norm, aes(x=time, y=strain, fill=log_norm_count)) + 
    geom_tile() + facet_wrap(~ gene) + 
    scale_fill_viridis() + theme_minimal()

ggplot(counts_norm, aes(x=time, y=log_norm_count, color=strain, group=strain)) + 
    geom_line() + facet_wrap(~ gene, scale="free")

Challenge

  1. Make a line plot as above but just showing the gene HHF1.

Hint: intToUtf8(utf8ToInt(“vtf!gjmufs”)-1)

  1. Which are the three most variable genes?

Hint: intToUtf8(utf8ToInt(“xvh#jurxsbe|/#vxppdul}h/#vg/#dqg#duudqjh”)-3)

  1. Different genes have different average expression levels, but what we are interested in is how they change over time. Further normalize the data by subtracting the average for each gene from log_norm_count.

Linear modelling

The idea of this section is to give a taste of what is possible, so that you know what topics to read up on or ask an expert for help with if this is the sort of analysis you need. It will also provide a little context for advanced forms of RNA-Seq differential expression analysis.

sut476 <- counts_norm %>% filter(gene=="SUT476") 
sut476_wt <- sut476 %>% filter(strain=="WT")

ggplot(sut476_wt,aes(x=time,y=log_norm_count)) +
    geom_point() +
    geom_smooth(method="lm")

ggplot2 includes a handy geom for fitting a curve or line to data, but what exactly is it doing? What if we want to work with that fitted curve ourselves? ggplot2 can use various curve fitting methods, the simplest of which is a linear model with “lm”.

model <- lm(log_norm_count ~ time, data=sut476_wt)
model
## 
## Call:
## lm(formula = log_norm_count ~ time, data = sut476_wt)
## 
## Coefficients:
## (Intercept)         time  
##    -5.21805      0.08813
# Note: The p-values listed by summary aren't always meaningful.
# It makes no sense to remove the intercept term but retain the time term.
# Significance testing can better be done with anova() and the glht package.
summary(model)
## 
## Call:
## lm(formula = log_norm_count ~ time, data = sut476_wt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9458 -0.4717 -0.1152  0.2420  1.5235 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.21805    0.66749  -7.817 0.000106 ***
## time         0.08813    0.10219   0.862 0.417017    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7915 on 7 degrees of freedom
## Multiple R-squared:  0.09605,    Adjusted R-squared:  -0.03309 
## F-statistic: 0.7438 on 1 and 7 DF,  p-value: 0.417
model.matrix(log_norm_count ~ time, data=sut476_wt)
##   (Intercept) time
## 1           1    2
## 2           1    3
## 3           1    4
## 4           1    5
## 5           1    6
## 6           1    7
## 7           1    8
## 8           1    9
## 9           1   10
## attr(,"assign")
## [1] 0 1

A model can be tested against a simpler null model using anova. Is the fit sufficiently better to justify the extra complexity?

null_model <- lm(log_norm_count ~ 1, data=sut476_wt)
anova(null_model, model)
## Analysis of Variance Table
## 
## Model 1: log_norm_count ~ 1
## Model 2: log_norm_count ~ time
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1      8 4.8518                           
## 2      7 4.3858  1   0.46601 0.7438  0.417

A linear trend is not supported, in comparison to the null hypothesis that expression is constant, by this data. Also useful for hypothesis testing, the multcomp package can be used to test comparisons of interest such as a set of pairwise comparisons. This is similar to testing of contrasts in packages for differential expression analysis such as limma, edgeR, or DESeq2.

augment from the broom package gives various diagnostic statistics.

augment(model, sut476_wt) %>% head
##   sample strain  set1  rrp6 time   gene count norm norm_count
## 1   WT:2     WT FALSE FALSE    2 SUT476     8   92 0.08695652
## 2   WT:3     WT FALSE FALSE    3 SUT476    10  503 0.01988072
## 3   WT:4     WT FALSE FALSE    4 SUT476    28  989 0.02831143
## 4   WT:5     WT FALSE FALSE    5 SUT476    23 1236 0.01860841
## 5   WT:6     WT FALSE FALSE    6 SUT476    31 1122 0.02762923
## 6   WT:7     WT FALSE FALSE    7 SUT476    47 1054 0.04459203
##   log_norm_count   .fitted   .se.fit     .resid      .hat    .sigma
## 1      -3.518249 -5.041789 0.4865111  1.5235400 0.3777778 0.3304857
## 2      -5.629393 -4.953660 0.4044709 -0.6757333 0.2611111 0.7924442
## 3      -5.126217 -4.865530 0.3337439 -0.2606864 0.1777778 0.8468699
## 4      -5.723242 -4.777401 0.2829451 -0.9458414 0.1277778 0.7483436
## 5      -5.161007 -4.689271 0.2638477 -0.4717353 0.1111111 0.8302040
## 6      -4.476729 -4.601142 0.2829451  0.1244130 0.1277778 0.8532328
##      .cooksd .std.resid
## 1 1.80748195  2.4400939
## 2 0.17427640 -0.9931416
## 3 0.01426122 -0.3632028
## 4 0.11991088 -1.2794703
## 5 0.02497354 -0.6321208
## 6 0.00207469  0.1682974
# augment() can also be used to produce predictions for new inputs
augment(model, newdata=sut476_wt[1:5,])
##   sample strain  set1  rrp6 time   gene count norm norm_count
## 1   WT:2     WT FALSE FALSE    2 SUT476     8   92 0.08695652
## 2   WT:3     WT FALSE FALSE    3 SUT476    10  503 0.01988072
## 3   WT:4     WT FALSE FALSE    4 SUT476    28  989 0.02831143
## 4   WT:5     WT FALSE FALSE    5 SUT476    23 1236 0.01860841
## 5   WT:6     WT FALSE FALSE    6 SUT476    31 1122 0.02762923
##   log_norm_count   .fitted   .se.fit
## 1      -3.518249 -5.041789 0.4865111
## 2      -5.629393 -4.953660 0.4044709
## 3      -5.126217 -4.865530 0.3337439
## 4      -5.723242 -4.777401 0.2829451
## 5      -5.161007 -4.689271 0.2638477
# Let's look at the model fit, and how influential each observation was
augment(model, sut476_wt) %>%
ggplot(aes(x=time, y=log_norm_count)) + 
    geom_point(aes(size=.cooksd), alpha=0.5) +
    geom_line(aes(y=.fitted))

We see the initial timepoint in the wildtype strain is wildly off the linear trend. The large Cook’s Distance indicates this is strongly affecting the model. Possibly early time points need to be removed, or (more likely!) a linear trend is not a good model of this data.

Linear models can be much more complicated than this. They can include multiple explanatory variables, including categorical variables, and interactions between them, and even polynomials.

model2 <- lm(log_norm_count ~ strain*poly(time,3), data=sut476)
model2
## 
## Call:
## lm(formula = log_norm_count ~ strain * poly(time, 3), data = sut476)
## 
## Coefficients:
##                    (Intercept)                      strainSET1  
##                        -4.6893                          1.4081  
##                     strainRRP6                 strainSET1-RRP6  
##                         4.2133                          3.5053  
##                 poly(time, 3)1                  poly(time, 3)2  
##                         1.3653                          2.8553  
##                 poly(time, 3)3       strainSET1:poly(time, 3)1  
##                        -2.4314                          2.0542  
##      strainRRP6:poly(time, 3)1  strainSET1-RRP6:poly(time, 3)1  
##                        -2.0608                         -0.5208  
##      strainSET1:poly(time, 3)2       strainRRP6:poly(time, 3)2  
##                        -3.9125                         -2.9297  
## strainSET1-RRP6:poly(time, 3)2       strainSET1:poly(time, 3)3  
##                        -2.1218                          1.9517  
##      strainRRP6:poly(time, 3)3  strainSET1-RRP6:poly(time, 3)3  
##                         2.1368                          2.2349
augment(model2, sut476) %>%
ggplot(aes(x=time, y=log_norm_count, color=strain, group=strain)) + 
    geom_point(aes(size=.cooksd), alpha=0.5) +
    geom_line(aes(y=.fitted))

sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] broom_0.4.1   viridis_0.3.4 ggplot2_2.1.0 dplyr_0.5.0   tidyr_0.5.1  
## [6] readr_0.2.2  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5      knitr_1.13       magrittr_1.5     mnormt_1.5-4    
##  [5] munsell_0.4.3    lattice_0.20-33  colorspace_1.2-6 R6_2.1.2        
##  [9] stringr_1.0.0    plyr_1.8.4       tools_3.3.0      parallel_3.3.0  
## [13] grid_3.3.0       nlme_3.1-128     gtable_0.2.0     psych_1.6.6     
## [17] DBI_0.4-1        htmltools_0.3.5  lazyeval_0.2.0   yaml_2.1.13     
## [21] assertthat_0.1   digest_0.6.9     tibble_1.1       gridExtra_2.2.1 
## [25] reshape2_1.4.1   formatR_1.4      evaluate_0.9     rmarkdown_1.0   
## [29] labeling_0.3     stringi_1.1.1    methods_3.3.0    scales_0.4.0